8 research outputs found

    An unsupervised perplexity-based method for boilerplate removal

    Get PDF
    The availability of large web-based corpora has led to significant advances in a wide range of technologies, including massive retrieval systems or deep neural networks. However, leveraging this data is challenging, since web content is plagued by the so-called boilerplate: ads, incomplete or noisy text and rests of the navigation structure, such as menus or navigation bars. In this work, we present a novel and efficient approach to extract useful and well-formed content from web-scraped data. Our approach takes advantage of Language Models and their implicit knowledge about correctly formed text, and we demonstrate here that perplexity is a valuable artefact that can contribute in terms of effectiveness and efficiency. As a matter of fact, the removal of noisy parts leads to lighter AI or search solutions that are effective and entail important reductions in resources spent. We exemplify here the usefulness of our method with two downstream tasks, search and classification, and a cleaning task. We also provide a Python package with pre-trained models and a web demo demonstrating the capabilities of our approachS

    Colaboración entre docentes de una universidad alemana y una española para el desarrollo de seminarios prácticos acerca de la credibilidad de la información

    Get PDF
    En esta experiencia docente, presentamos una colaboración internacional entre investigadores de la Universidad de Regensburg (Baviera, Alemania) y la Universidad de Santiago de Compostela (Galicia, España), en la que el alumnado participa en dos materias de Ciencias de la Información. Se propone un enfoque en el que los estudiantes construyen su propia comprensión de las tecnologías mediante el planteamiento de un problema en el contexto de un proyecto de investigación. El área de interés en la que se propone el desafío se orienta a cómo ayudar a los usuarios finales de las tecnologías de información a establecer la credibilidad de la información. Para ello, se ha trabajado con contenidos online relacionados con la pandemia de COVID-19.In this teaching experience, we present an international collaboration between researchers from a German university and a Spanish university, in which students participate in two Information Science courses. An approach is proposed in which students build their own understanding of technologies by posing a problem in the context of a research project. The area of interest in which the challenge is proposed is oriented towards how to help end-users of information technologies to establish the credibility of information. To this end, work has been done with online content related to the COVID-19 pandemic.Este trabajo ha sido financiado por FEDER/Ministerio de Ciencia, Innovación y Universidades – Agencia Estatal de Investigación (RTI2018-093336-B-C21), y también ha recibido apoyo financiero de la Consellería de Educación, Universidade e Formación Profesional (acreditación 2019-2022 ED431G-2019/04, ED431C 2018/29, ED431C 2018/19) y de los fondos FEDER, que reconocen al CiTIUS de la USC como Centro de Investigación del Sistema Universitario de Galicia

    Personality trait analysis during the COVID-19 pandemic: a comparative study on social media

    Get PDF
    The COVID-19 pandemic, a global contagion of coronavirus infection caused by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), has triggered severe social and economic disruption around the world and provoked changes in people’s behavior. Given the extreme societal impact of COVID-19, it becomes crucial to understand the emotional response of the people and the impact of COVID-19 on personality traits and psychological dimensions. In this study, we contribute to this goal by thoroughly analyzing the evolution of personality and psychological aspects in a large-scale collection of tweets extracted during the COVID-19 pandemic. The objectives of this research are: i) to provide evidence that helps to understand the estimated impact of the pandemic on people’s temperament, ii) to find associations and trends between specific events (e.g., stages of harsh confinement) and people’s reactions, and iii) to study the evolution of multiple personality aspects, such as the degree of introversion or the level of neuroticism. We also examine the development of emotions, as a natural complement to the automatic analysis of the personality dimensions. To achieve our goals, we have created two large collections of tweets (geotagged in the United States and Spain, respectively), collected during the pandemic. Our work reveals interesting trends in personality dimensions, emotions, and events. For example, during the pandemic period, we found increasing traces of introversion and neuroticism. Another interesting insight from our study is that the most frequent signs of personality disorders are those related to depression, schizophrenia, and narcissism. We also found some peaks of negative/positive emotions related to specific eventsOpen Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. The authors thank the support obtained from: i) project PLEC2021-007662 (MCIN/AEI/10.13039/501100011033, Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación, Plan de Recuperación, Transformación y Resiliencia, Unión Europea-Next GenerationEU), ii) project PID2022-137061OB-C22 (Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación, Proyectos de Generación de Conocimiento; suppported by the European Regional Development Fund) and iii) Consellería de Educación, Universidade e Formación Profesional (accreditation 2019-2022 ED431G-2019/04, ED431C 2022/19) and the European Regional Development Fund, which acknowledges the CiTIUS-Research Center in Intelligent Technologies of the University of Santiago de Compostela as a Research Center of the Galician University SystemS

    PoS tagging and Named Entitiy Recognition in a Big Data environment

    Get PDF
    Este artículo describe una suite de módulos lingüísticos para el castellano, basado en una arquitectura en tuberías, que incluye tareas de análisis morfosintáctico así como de reconocimiento y clasificación de entidades nombradas. Se han aplicado técnicas de paralelización en un entorno Big Data para conseguir que la suite de módulos sea más eficiente y escalable y, de este modo, reducir de forma significativa los tiempos de cómputo con los que poder abordar problemas a la escala de la Web. Los módulos han sido desarrollados con técnicas básicas para facilitar su integración en entornos distribuidos, con un rendimiento próximo al estado del arte.This article describes a suite of linguistic modules for the Spanish language based on a pipeline architecture, which contains tasks for PoS tagging and Named Entity Recognition and Classification (NERC). We have applied run-time parallelization techniques in a Big Data environment in order to make the suite of modules more efficient and scalable, and thereby to reduce computation time in a significant way. Therefore, we can address problems at Web scale. The linguistic modules have been developed using basic NLP techniques in order to easily integrate them in distributed computing environments. The qualitative performance of the modules is close the state of the art.Este trabajo ha sido subvencionado con cargo a los proyectos HPCPLN - Ref:EM13/041 (Programa Emergentes, Xunta de Galicia), Celtic - Ref:2012-CE138 y Plastic - Ref:2013-CE298 (Programa Feder-Innterconecta)

    PoS tagging and Named Entitiy Recognition in a Big Data environment

    Get PDF
    Este artículo describe una suite de módulos lingüísticos para el castellano, basado en una arquitectura en tuberías, que incluye tareas de análisis morfosintáctico así como de reconocimiento y clasificación de entidades nombradas. Se han aplicado técnicas de paralelización en un entorno Big Data para conseguir que la suite de módulos sea más eficiente y escalable y, de este modo, reducir de forma significativa los tiempos de cómputo con los que poder abordar problemas a la escala de la Web. Los módulos han sido desarrollados con técnicas básicas para facilitar su integración en entornos distribuidos, con un rendimiento próximo al estado del arteThis article describes a suite of linguistic modules for the Spanish language based on a pipeline architecture, which contains tasks for PoS tagging and Named Entity Recognition and Classification (NERC). We have applied run-time parallelization techniques in a Big Data environment in order to make the suite of modules more efficient and scalable, and thereby to reduce computation time in a significant way. Therefore, we can address problems at Web scale. The linguistic modules have been developed using basic NLP techniques in order to easily integrate them in distributed computing environments. The qualitative performance of the modules is close the the state of the artEste trabajo ha sido subvencionado con cargo a los proyectos HPCPLN - Ref:EM13/041 (Programa Emergentes, Xunta de Galicia), Celtic - Ref:2012- CE138 y Plastic - Ref:2013-CE298 (Programa Feder-Innterconecta)S

    Reliability Prediction for Health-related Content: A Replicability Study

    No full text
    Determining reliability of online data is a challenge that has recently received increasing attention. In particular, unreliable health-related content has become pervasive during the COVID-19 pandemic.Previous research has approached this problem with standard classi-fication technology using a set of features that have included linguisticand external variables, among others. In this work, we aim to replicateparts of the study conducted by Sondhi and his colleagues using our owncode, and make it available for the research community. The perfor-mance obtained in this study is as strong as the one reported by theoriginal authors. Moreover, their conclusions are also confirmed by ourreplicability study. We report on the challenges involved in replication,including that it was impossible to replicate the computation of somefeatures (since some tools or services originally used are now outdatedor unavailable). Finally, we also report on a generalisation effort madeto evaluate our predictive technology over new datasets

    Comparing Traditional and Neural Approaches for Detecting Health-Related Misinformation

    No full text
    Detecting health-related misinformation is a research challenge that has recently received increasing attention. Helping people to find credible and accurate health information on the Web remains an open research issue as has been highlighted during the COVID-19 pandemic. However, in such scenarios, it is often critical to detect misinformation quickly [34], which implies working with little data, at least at the beginning of the spread of such information. In this work, we present a comparison between different automatic approaches of identifying misinformation, and we compare how they behave for different tasks and with limited training data. We experiment with traditional algorithms, such as SVMs or KNNs, as well as newer BERT-based models [5]. Our experiments utilise the CLEF 2018 Consumer Health Search task dataset [16] to perform experiments on detecting untrustworthy contents and information that is difficult to read. Our results suggest that traditional models are still a strong baseline for these challenging tasks. In the absence of substantive training data, classical approaches tend to outperform BERT-based models

    DepreSym: A Depression Symptom Annotated Corpus and the Role of LLMs as Assessors of Psychological Markers

    Full text link
    Computational methods for depression detection aim to mine traces of depression from online publications posted by Internet users. However, solutions trained on existing collections exhibit limited generalisation and interpretability. To tackle these issues, recent studies have shown that identifying depressive symptoms can lead to more robust models. The eRisk initiative fosters research on this area and has recently proposed a new ranking task focused on developing search methods to find sentences related to depressive symptoms. This search challenge relies on the symptoms specified by the Beck Depression Inventory-II (BDI-II), a questionnaire widely used in clinical practice. Based on the participant systems' results, we present the DepreSym dataset, consisting of 21580 sentences annotated according to their relevance to the 21 BDI-II symptoms. The labelled sentences come from a pool of diverse ranking methods, and the final dataset serves as a valuable resource for advancing the development of models that incorporate depressive markers such as clinical symptoms. Due to the complex nature of this relevance annotation, we designed a robust assessment methodology carried out by three expert assessors (including an expert psychologist). Additionally, we explore here the feasibility of employing recent Large Language Models (ChatGPT and GPT4) as potential assessors in this complex task. We undertake a comprehensive examination of their performance, determine their main limitations and analyze their role as a complement or replacement for human annotators
    corecore